The goal of this project is to create a machine learning model that can successfully predict whether a patient will die due to heart failure based off of some patient history and vital signs.
Although heart failure sounds like the heart may have stopped, this is not the case. Heart failure, which is also known as congestive heart failure is a serious, incurable condition where the heart does not work properly and fails to pump blood sufficiently throughout the body for its needs. Heart failure may occur of the heart can’t fill up with enough blood or if the heart is simply too weak to properly pump.
According to the Center for Disease Control and Prevention, more than 6 million adults in the United States suffer from heart failure.
According to the National Heart, Lung, and Blood Institute (NHLBI), “Heart failure may not cause symptoms right away. But eventually, you may feel tired and short of breath and notice fluid buildup in your lower body, around your stomach, or your neck.” Heart failure can also eventually cause damage to other organs such as the liver or kidneys and lead to other conditions such as pulmonary hypertension, heart valve disease, and sudden cardiac arrest.
Although heart disease is incurable, the Mayo Clinic states that “Proper treatment can improve the signs and symptoms of heart failure and may help some people live longer,” and that “Lifestyle changes - such as losing weight, exercising, and managing stress - can improve your quality of life” (“Heart Failure” 2021)
Although heart failure may be incurable, it could still be beneficial for medical professionals to predict whether a patient may develop and potentially die from heart failure. For example, if a doctor can determine with high probability that a patient may develop heart failure later in life, they may be able to inform the patient so that they can make lifestyle changes early enough to prevent the most significant symptoms.
Additionally, although the body initially tries to mask the problem of heart failure through various mechanisms such as enlarging the heart, developing more muscle mass, or pumping faster, these solutions are all temporary and in these cases, heart failure will simply progress until the onset of more serious symptoms such as fatigue or breathing problems. Since treatment can often slow down the progression of heart failure, having a machine learning model that could successfully predict a person’s chances of suffering and hence dying from heart failure would mean that we could increase early detection and likely catch more cases early on and slow the progression of the disease.
Since the data set I will use includes deaths as a result of heart failure, creating an effective machine learning model out of this data set would also allow doctors to preemptively begin treatment that may prevent the patient from dying due to heart failure.
This data set was assembled as part of a study conducted on heart failure patients who were admitted to Institute of Cardiology and Allied hospital Faisalabad-Pakistan between April-December 2015 (Ahmad T 2017). All patients in this case had left-ventricular systolic dysfunction, meaning that the left ventricle was unable to contract vigorously, which would indicate a pumping problem (“Heart Failure” 2021). Furthermore, patients in this study all fell into the New York Heart Association (NYHA) Functional Classification levels III and IV.
We will first begin by loading in the packages we will use for the
project and by loading raw heart failure data to the variable
heartfailure_data.
# Loading in libraries we will be using
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(knitr)
library(corrplot)
library(ggthemes)
library(gt)
library(gtExtras)
library(visdat)
library(fastDummies)
tidymodels_prefer()
# Read raw data into a data frame.
heartfailure_data <- read_csv("heart_failure_clinical_records_dataset.csv")
head(heartfailure_data) %>%
gt() %>%
gt_theme_nytimes() %>%
tab_header("Heart Failure Data")
| Heart Failure Data | ||||||||||||
| age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | high_blood_pressure | platelets | serum_creatinine | serum_sodium | sex | smoking | time | DEATH_EVENT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 75 | 0 | 582 | 0 | 20 | 1 | 265000 | 1.9 | 130 | 1 | 0 | 4 | 1 |
| 55 | 0 | 7861 | 0 | 38 | 0 | 263358 | 1.1 | 136 | 1 | 0 | 6 | 1 |
| 65 | 0 | 146 | 0 | 20 | 0 | 162000 | 1.3 | 129 | 1 | 1 | 7 | 1 |
| 50 | 1 | 111 | 0 | 20 | 0 | 210000 | 1.9 | 137 | 1 | 0 | 7 | 1 |
| 65 | 1 | 160 | 1 | 20 | 0 | 327000 | 2.7 | 116 | 0 | 0 | 8 | 1 |
| 90 | 1 | 47 | 0 | 40 | 1 | 204000 | 2.1 | 132 | 1 | 1 | 8 | 1 |
The data was obtained from the Kaggle Data set “Heart Failure Prediction”, with the original data being from a study conducted by Tanvir Ahmad, Assia Munir, Sajjad Haider Bhatti, Muhammad Aftab, and Muhammad Ali Raza. \[\\\]
We can now look at some basic information about the size of our data set:
dim(heartfailure_data)
## [1] 299 13
We can see that our data set has 299 observations to go along with 13 variables. Let us now take a look at a summary of our variables:
vis_dat(heartfailure_data)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the visdat package.
## Please report the issue at <]8;;https://github.com/ropensci/visdat/issueshttps://github.com/ropensci/visdat/issues]8;;>.
We can see that our data does not include any missing data, so that
is not something that we need to worry about. We also see that all of
our data is of type numeric, even though some of our variables,
including anaemia, diabetes,
high_blood_pressure, sex,
smoking, and DEATH_EVENT are binary, so we
will have to deal with that.
# Converting categorical variables to factors.
heartfailure_data <- heartfailure_data %>%
mutate(DEATH_EVENT = factor(DEATH_EVENT, levels = c("1", "0")),
anaemia = factor(anaemia),
diabetes = factor(diabetes),
high_blood_pressure = factor(high_blood_pressure),
sex = factor(sex),
smoking = factor(smoking))
When looking at information about the original data set, I also
noticed that time indicated either the number of days until
the patients died, or the number of days until the patient was censored,
which in this case simply means that they did not die. Due to this, I
have decided that this information would not only be hard to interpret,
it would also be irrelevant to whether the patient actually died of
heart failure or not so will elect to remove that from the data set I
will use for the machine learning models.
heartfailure_data <- heartfailure_data %>% select(-time)
Now we are left with the following variables which will be utilized
for the machine learning model: - age: The age of the
patients in the study
anaemia: Patients were considered anemic (indicated
by a 1) if their haematocrit levels were lower than 36%. (indicated by 0
if patient was not anemic).
creatinine_phosphokinase: The amount of creatinine
phosphokinase (CPK) in the blood. CPK is often released into the blood
when muscle tissue gets damaged.
diabetes: 1 if the patient has diabetes, 0 if
patient does not have diabetes.
ejection_fraction: Indicates the percentage of blood
the left ventricle pumped out upon each contraction.
platelets: Result of platelet count, which measure
the number of platelets in the blood.
serum_creatinine: Creatinine levels in the blood.
High serum creatinine levels indicate that the kidneys may not be
functioning properly Kenny
(2010).
serum_sodium: Results of a blood sodium test. Low
serum sodium levels may be an indicator of heart failure (Case-Lo 2018)
sex: 1 if the patient is male, 0 if the patient is
female.
smoking: 1 if the patient smokes, 0 if the patient
does not smoke.
DEATH_EVENT: 1 if the patient died during the course
of the study, 0 if the patient did not die during the course of the
study.
We will first look at the distribution of heart failure deaths
ggplot(heartfailure_data, aes(x = as.factor(DEATH_EVENT), fill = "pink")) +
geom_bar() +
scale_fill_manual(values = "pink") +
labs(title = "Distribution of Heart Failure Deaths", x = "Death Event", y = "Count") +
theme(legend.position = "none")
From the histogram, we see that most of the patients in the study did not die during the duration of the study. In fact, of the 299 observations, 96 (32.1%) of the patients died and 203 (67.9%) of the patients did not die.
To see if there is any correlation between our numeric variables, we will now make a correlation heat map of the correlation between the predictors.
heartfailcorr <- heartfailure_data %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(order = "AOE", addCoef.col = "black")
From the heat map, we see that there is little to no correlation between all of the numeric variables. At first it may seem suspicious that none of the variables are really correlated with each other but upon further understanding of all the numeric variables, the result makes sense. In this case, with the exception of age, the values we received were all from tests that the doctors would have performed. Since each test was used to measure completely different aspects of the patient’s health, there is no reason to believe that any of the tests performed should have results that strongly correlate with one another.
It is often the case that older patients are more likely to die from medical conditions that may arise. From the box plot shown, we see that with heart failure, this is indeed the case. We see that the age of patients who died over the course of the study tended to lean quite a bit higher than for patients who did not die. Both the median age for the patients who died and the 75th percentile for patients who did not die was 65. The median age for patients who did not die was 60 years old.
ggplot(data = heartfailure_data, aes(x = age, group = DEATH_EVENT, fill = DEATH_EVENT)) +
geom_boxplot() +
scale_fill_manual(labels = c("Patient Did Not Die", "Patient Died"), values = c("lightblue", "pink")) +
labs(title = "Age Distribution of Patients who Lived/Died during Study", x = "Age", fill = "Death Event")
This data set was comprised of 105 women and 194 men. From the data, we also find that 31.96% of men and 32.38% of women in the data set died during the duration of the study.
# make bar plot with sex distribution and likelihood of death by sex involved
sex_renamed <- heartfailure_data
levels(sex_renamed$sex)[levels(sex_renamed$sex) == "0"] <- "Female"
levels(sex_renamed$sex)[levels(sex_renamed$sex) == "1"] <- "Male"
ggplot(data = sex_renamed, aes(x = sex, group = DEATH_EVENT, fill = DEATH_EVENT)) +
geom_bar() +
scale_fill_manual(labels = c("Patient Did Not Die", "Patient Died"), values = c("lightblue", "pink")) +
labs(title = "Distribution of Sex", y = "Count", fill = "Death Event")
Creatinine Phosphokinase (CPK) is often released into the blood when muscle tissue gets damaged, making it a relatively good metric upon which heart failure can be diagnosed (Roth 2019). Depending on factors such as age, gender, and activity level, generally, 24-204 U/L is considered normal (“Creatinine Kinase,” n.d.). Since all patients in the study were already experiencing heart failure and fell into NYHA classification levels III and IV, we would expect their CPK levels to be higher than the average population. We find that the average CPK levels for patients who died was 670 while average CPK levels for those who did not die was 540.
ggplot(data = heartfailure_data, aes(x = creatinine_phosphokinase, group = DEATH_EVENT, fill = DEATH_EVENT)) +
geom_boxplot() +
scale_fill_manual(labels = c("Patient Did Not Die", "Patient Died"), values = c("lightblue", "pink")) +
labs(title = "Creatinine Phosphokinase Distribution of Patients", x = "Creatinine Phosphokinase Levels (IU/L)", fill = "Death Event")
Ejection fraction is a measure of the percentage of blood that the ventricles pump out upon each contraction, and a lower fraction would mean that an individual’s heart was having difficulty keeping up with the body’s needs and would be a strong indication of heart failure. A healthy male would have an ejection fraction in the range of 52%-72% and a healthy woman would have an ejection fraction in the range of 54%-74%, with values lower than that indicating patients in progressively worse conditions (“Ejection Fraction” 2022). Since all patients were either in the NYHA classification levels III or IV, we would expect that ejection fractions would likely be lower than expected for the average population; however, we also see that the distribution of ejection fraction for patients who died due to heart failure is lower than that of patients who did not die over the duration of the study.
ggplot(data = heartfailure_data, aes(x = ejection_fraction, group = DEATH_EVENT, fill = DEATH_EVENT)) +
geom_boxplot() +
scale_fill_manual(labels = c("Patient Did Not Die", "Patient Died"), values = c("lightblue", "pink")) +
labs(title = "Ejection Fraction Distribution of Patients", x = "Ejection Fraction", fill = "Death Event")
A normal platelet count ranges from 150,000 to 450,000 platelets per micro liter of blood (Williams, n.d.). Research has shown that patients with thrombocytopaenia, or a low platelet count may be linked to higher rates of all-cause mortality (Mojadidi MK 2018). From the plots, it appears that most patients fall within the normal platelet count level and no differences are immediately noticeable by eye.
options(scipen = 999)
ggplot(data = heartfailure_data, aes(x = platelets, group = DEATH_EVENT, fill = DEATH_EVENT)) +
geom_boxplot() +
scale_fill_manual(labels = c("Patient Did Not Die", "Patient Died"), values = c("lightblue", "pink")) +
labs(title = "Platelet Count Distribution of Patients", x = "Platelet Count (platelets/mcL)", fill = "Death Event")
Serum Creatinine is a waste product of creatinine, a chemical made by the body that is used to supply energy to muscles. Normal results are 0.7 t0 1.3 mg/dL for men and 0.6 to 1.1 mg/dL for women (Dugdale 2021). Serum creatinine above 1.5 mg/dL is associated with renal failure (Ahmad T 2017). There is evidence that chronic kidney disease contributes to cardiac damage and that heart failure is also a major cause of chronic kidney disease (Silverberg D 2004). We see from the data that patients who died tended to have higher serum creatinine levels, which indicates that their kidneys would likely have been performing sub-optimally.
ggplot(data = heartfailure_data, aes(x = serum_creatinine, group = DEATH_EVENT, fill = DEATH_EVENT)) +
geom_boxplot() +
scale_fill_manual(labels = c("Patient Did Not Die", "Patient Died"), values = c("lightblue", "pink")) +
labs(title = "Serum Creatinine Distribution of Patients", x = "Serum Creatinine (mg/dL)", fill = "Death Event")
The normal range for serum sodium is between 135 and 145 milliequivalents per liter (mEq/L) (“Hyponatremia” 2022). It appears that in general, patients who died over the duration of the study tended to have lower serum sodium levels than patients who did not die, so perhaps that could have indicated further progression of heart failure. Having low serum sodium levels, called hyponatremia is an indicator of heart failure, as it can cause fluids to accumulate in the body, which would dilute sodium in the body.
ggplot(data = heartfailure_data, aes(x = serum_sodium, group = DEATH_EVENT, fill = DEATH_EVENT)) +
geom_boxplot() +
scale_fill_manual(labels = c("Patient Did Not Die", "Patient Died"), values = c("lightblue", "pink")) +
labs(title = "Serum Sodium Distribution of Patients", x = "Serum Sodium (mEql/L)", fill = "Death Event")
Upon basic analysis of our data, we know how various variables may be distributed as they relate to heart failure. We can now move on to the creation of our models. We must first randomly split our data into training and testing sets, create our recipes, and implement cross validation to assist with our models.
The first step towards the creation of our model is to first split
our data into two groups: the training set and the test set. Splitting
the data will allow our model to learn what characteristics contribute
to a heart failure patient’s death while preventing the potential
over-fitting of data. After the model has been trained using the
training set, we will be able to gauge the effectiveness of our model on
the testing set. In order to keep our results consistent, we will also
set a seed so that the data will be split at the same point every time.
Finally, to ensure that both sets of data have the same number of deaths
we will stratify our split on DEATH_EVENT.
set.seed(12) # Set seed to keep split consistent each time
# Creating initial 75/15 data split
heartfailure_split <- heartfailure_data %>%
initial_split(prop = 0.75, strata = "DEATH_EVENT")
heartfailure_train <- training(heartfailure_split) # Setting up training set
heartfailure_test <- testing(heartfailure_split) # Setting up testing set
dim(heartfailure_train)
## [1] 224 12
dim(heartfailure_test)
## [1] 75 12
After the split, there are 224 observations in the training set and 75 observations in the testing set.
We can now put our predictors and response variables together to build the recipe which we will use throughout all our models. Since most models use pretty much the same predictors and conditions, we will create one recipe that can be implemented among the different models and which can be adjusted based on the models that we use. The recipe essentially contains the parts that we need in order to successfully create our machine learning models.
For our recipe, we will use the 11 predictors that we had previously
mentioned: age, amaemia,
creatinine_phosphokinase, diabetes,
ejection_fraction, high_blood_pressure,
platelets, serum_creatinine,
serum_sodium, sex, and
smoking.
For the recipe to work, we will also have to convert our categorical
variables into dummy variables, and these variables are:
anaemia, diabetes,
high_blood_pressure, sex,
smoking.
We will also have to normalize our variables by centering and scaling the recipe.
heartfailure_recipe <- recipe(DEATH_EVENT ~., data = heartfailure_data) %>%
step_dummy(all_nominal_predictors()) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
Now that we have all the initial setup completed, it is time for us to finally build our models